33 research outputs found
Neural Architecture Search as Program Transformation Exploration
Improving the performance of deep neural networks (DNNs) is important to both
the compiler and neural architecture search (NAS) communities. Compilers apply
program transformations in order to exploit hardware parallelism and memory
hierarchy. However, legality concerns mean they fail to exploit the natural
robustness of neural networks. In contrast, NAS techniques mutate networks by
operations such as the grouping or bottlenecking of convolutions, exploiting
the resilience of DNNs. In this work, we express such neural architecture
operations as program transformations whose legality depends on a notion of
representational capacity. This allows them to be combined with existing
transformations into a unified optimization framework. This unification allows
us to express existing NAS operations as combinations of simpler
transformations. Crucially, it allows us to generate and explore new tensor
convolutions. We prototyped the combined framework in TVM and were able to find
optimizations across different DNNs, that significantly reduce inference time -
over 3 in the majority of cases.
Furthermore, our scheme dramatically reduces NAS search time. Code is
available
at~\href{https://github.com/jack-willturner/nas-as-program-transformation-exploration}{this
https url}
mlirSynth: Automatic, Retargetable Program Raising in Multi-Level IR using Program Synthesis
MLIR is an emerging compiler infrastructure for modern hardware, but existing programs cannot take advantage of MLIR’s high-performance compilation if they are described in lower-level general purpose languages. Consequently, to avoid programs needing to be rewritten manually, this has led to efforts to automatically raise lower-level to higher-level dialects in MLIR. However, current methods rely on manually-defined raising rules, which limit their applicability and make them challenging to maintain as MLIR dialects evolve. We present mlirSynth – a novel approach which translates programs from lower-level MLIR dialects to high-level ones without manually defined rules. Instead, it uses available dialect definitions to construct a program space and searches it effectively using type constraints and equivalences. We demonstrate its effectiveness by raising C programs to two distinct high-level MLIR dialects, which enables us to use existing high-level dialect specific compilation flows. On Polybench, we show a greater coverage than previous approaches, resulting in geomean speedups of 2.5x (Intel) and 3.4x (AMD) over state-of-the-art compilation flows. mlirSynth also enables retargetability to domain-specific accelerators, resulting in a geomean speedup of 21.6x on a TPU
mlirSynth: Automatic, Retargetable Program Raising in Multi-Level IR using Program Synthesis
MLIR is an emerging compiler infrastructure for modern hardware, but existing
programs cannot take advantage of MLIR's high-performance compilation if they
are described in lower-level general purpose languages. Consequently, to avoid
programs needing to be rewritten manually, this has led to efforts to
automatically raise lower-level to higher-level dialects in MLIR. However,
current methods rely on manually-defined raising rules, which limit their
applicability and make them challenging to maintain as MLIR dialects evolve.
We present mlirSynth -- a novel approach which translates programs from
lower-level MLIR dialects to high-level ones without manually defined rules.
Instead, it uses available dialect definitions to construct a program space and
searches it effectively using type constraints and equivalences. We demonstrate
its effectiveness \revi{by raising C programs} to two distinct high-level MLIR
dialects, which enables us to use existing high-level dialect specific
compilation flows. On Polybench, we show a greater coverage than previous
approaches, resulting in geomean speedups of 2.5x (Intel) and 3.4x (AMD) over
state-of-the-art compilation flows for the C programming language. mlirSynth
also enables retargetability to domain-specific accelerators, resulting in a
geomean speedup of 21.6x on a TPU
Expert Programmer versus Parallelizing Compiler: A Comparative Study of Two Approaches for Distributed Shared Memory
This article critically examines current parallel programming practice and optimizing compiler development. The general strategies employed by compiler and programmer to optimize a Fortran program are described, and then illustrated for a specific case by applying them to a well-known scientific program, TRED2, using the KSR-1 as the target architecture. Extensive measurement is applied to the resulting versions of the program, which are compared with a version produced by a commercial optimizing compiler, KAP. The compiler strategy significantly outperforms KAP and does not fall far short of the performance achieved by the programmer. Following the experimental section each approach is critiqued by the other. Perceived flaws, advantages, and common ground are outlined, with an eye to improving both schemes
Rewriting History: Repurposing Domain-Specific CGRAs
Coarse-grained reconfigurable arrays (CGRAs) are domain-specific devices
promising both the flexibility of FPGAs and the performance of ASICs. However,
with restricted domains comes a danger: designing chips that cannot accelerate
enough current and future software to justify the hardware cost. We introduce
FlexC, the first flexible CGRA compiler, which allows CGRAs to be adapted to
operations they do not natively support.
FlexC uses dataflow rewriting, replacing unsupported regions of code with
equivalent operations that are supported by the CGRA. We use equality
saturation, a technique enabling efficient exploration of a large space of
rewrite rules, to effectively search through the program-space for supported
programs. We applied FlexC to over 2,000 loop kernels, compiling to four
different research CGRAs and 300 generated CGRAs and demonstrate a 2.2
increase in the number of loop kernels accelerated leading to 3 speedup
compared to an Arm A5 CPU on kernels that would otherwise be unsupported by the
accelerator